RFC: kernel compile separation + mixed-TP + MX client adoption (post-#2389)#2652
Draft
KavinKrishnan wants to merge 5 commits into
Draft
Conversation
…le, mixed-TP, MX clients Proposes the next phase of work on top of `nixl_mx` once PrimeIntellect-ai#2389 merges: 1. Phase-1 — six surgical fixes against the in-tree code that close the bug classes we hit during GB200 bring-up (cross-subnet add_remote_agent full-mesh; stale READY peer dedup; heartbeat / STALE-on-shutdown; hardcoded 1200s timeout; non-MLA model guard; HSDP barrier ordering). Line-pinned against HEAD `79ea824d8`. 2. Phase-2 — graduate `src/prime_rl/transport/mx_rendezvous.py` onto NVIDIA's published `modelexpress` Python clients (`MxV2TrainingPublisher` / `MxV2RefitReceiver`). Deletes ~185 LOC of in-tree rendezvous that duplicates the upstream client. Inherits heartbeat + freshest-per-rank dedup + retention + sidecar-filter for free. `NixlAgentWrapper` / `Slot` / `TransportPlan` / `classic_cuda_pool` stay — those are prime-rl specialization. 3. Phase-3 — solves the trainer-side kernel-compile issue surfaced during PrimeIntellect-ai#2389's FP8 cast-pipeline iteration. Trainer publishes HF-raw bytes (kernel-agnostic); inference compiles into its target layout (DeepGemm, cutlass, ...) via a receiver-side scratch-buffer pass. Extends the v2 shape registry with `compile_target` + `compile_metadata`. Heterogeneous fleets (mixed kernels on the same training run) now work without trainer-side branching. 4. Phase-3 also generalizes the v2 sharding metadata to handle mixed-TP/EP via `TargetTPLayout` + multi-source slice discovery in the same machinery NemoRL v2 uses for MoE expert filtering. Pulls heavily on the NemoRL × Dynamo path (NVIDIA, John Thompson) which is already running at 380 Gbps on GB300 RoCE for an 8.82 GB refit — same scratch-buffer + worker-extension-cls pattern this plan adopts. Component + per-refit sequence diagrams (mermaid) included. Estimated ~450 LOC additive across modelexpress + prime-rl for Phases 3-4 (plus the ~400 LOC subtraction from Phase 2). Doc only. Implementation phases sequenced behind the upstream merge of PrimeIntellect-ai#2389.
… (v0.7.x) Captures the empirical findings from baking PRs #1 and #2 into an ARM64 GB200 image and running it on the kavin namespace for 8+ hours on Qwen3-30B-A3B-Instruct-2507 with gsm8k. Documents three real surprises the unit tests didn't cover: 1. Dockerfile.cuda's `uv sync` is missing `--extra disagg`, so modelexpress isn't installed in stock images; inference workers crash at the first import. Shipped v0.7.1 as a one-line overlay that adds the extra until the upstream Dockerfile.cuda can be updated. 2. `LD_PRELOAD` path for libcudart.so.12 — v0.5.2 had /usr/local/cuda present in the final stage; v0.7.0 (built from upstream Dockerfile.cuda as-is) doesn't. The pip-installed wheel path (/app/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/) is the new canonical location. 3. The configmap monkeypatch (patch_nixl_mx.py) and Phase 2's source-baked fixes are complementary — they patch different layers (broadcast vs rendezvous-wait) and both should stay until PR #1 merges upstream. Build experience numbers: - v0.7.0 from-scratch ARM64 build under QEMU: 6h45min (uv sync 45m, flash-attn from source 3h45m). - v0.7.1 overlay on top of v0.7.0: ~3 min. Cluster observations from v0.5.2 + configmap monkeypatch (the runtime-patched path our PR #1 codifies into source): - 183 successful RL refit cycles in one 66-min uninterrupted window - Reward variance 0.5-1.0 across orchestrator steps (real learning) - Off-policy level = 0 throughout - Zero NIXL data-plane errors - Recurring orchestrator wait_for_all_peers_ready timeout (~once per 30-66 min) is the exact bug class Phase 2's rendezvous-level dedup eliminates Also notes seven RFC updates queued in pensieve/RL/PrimeRL/09_rfc_updates_needed.md, three of which are new from this build experience (disagg extra, LD_PRELOAD path, vLLM PR #43375 / Anyscale RDT positioning). Companion to the RFC at docs/proposals/post-pr2389-kernel-compile-plan.md.
…/3/4 upstream form vLLM published https://vllm.ai/blog/2026-05-28-native-rl-apis the same day, announcing a standardized WeightTransferEngine abstract base + 4-phase lifecycle (init / start / update / finish) + a pluggable WeightTransferEngineFactory.register_engine(...) extension point. This is the upstream integration seam that the in-tree MxRendezvous reimplementation in PR PrimeIntellect-ai#2389 and the worker_extension_cls injection in inference/vllm/worker/nixl_mx.py have been emulating. The cleanest form of all our Phase 2/3/4 work upstream is a single MxWeightTransferEngine adapter (~150-200 LOC) that subclasses WeightTransferEngine and wraps the existing MxV2RefitReceiver + MxV2TrainingPublisher. Three immediate consequences captured in §8: §8.1 — Phase 2/3/4 should be repackaged as MxWeightTransferEngine for upstream contribution; the existing patches stay correct, the packaging just becomes upstream-native. §8.2 — The blog credits Matej Sirovatka specifically. He's likely mid-flight on a native-APIs rewrite of prime-rl's nixl_mx broadcast. Ask him before pushing Phase 2 upstream; the work may retarget to the adapter path directly. §8.3 — Their validation was at 16x 8xH200, DPEP32, 256 GPUs total. That scale makes Phase 4's multi-source slice planning load-bearing (mixed-TP/EP is the common case), not optional. Validates the design direction and sets the next cluster validation target after the DP=4 kavin smoke. §8.4 — pause_generation(mode="keep") + two-phase DPEP pause are features we don't yet match. Keep mode unlocks true async RL; queue after Phase 2 lands. Updated follow-up list grows from 4 to 7 items, with the three new ones being: write MxWeightTransferEngine, adopt keep-mode pause in the orchestrator, and coordinate with Robert Shaw / the vLLM RL roadmap on the K8s-native weight transfer engine they mention as ongoing work (which describes MX itself, modulo who's driving the upstream PR).
…three docs
The three proposal docs now form a coherent set:
- post-pr2389-status-and-plan.md — executive summary; failure-class
to fix mapping; mermaid diagram
of the data + metadata planes;
Phase 0 unblock guidance
- post-pr2389-kernel-compile-plan.md — full RFC with phase-by-phase
design rationale (unchanged
except for cross-link header)
- build-notes-2026-05-28.md — operational findings from the
source-baked image build, plus
the vLLM native RL APIs reframe
in section 8
Each doc now has a header block linking to the other two so readers
can navigate based on intent (status vs design vs operational).
The status-and-plan doc is the natural entry point for someone coming
to the work cold; the RFC and build-notes are the deep dives.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a doc-only RFC layered on top of #2389. Proposes the next phase of work once #2389 merges to
main:nixl_mxcode (close the bug classes we hit during GB200 bring-up: cross-subnetadd_remote_agentfull-mesh, staleREADYpeer dedup, heartbeat / STALE-on-shutdown, hardcoded 1200 s timeout, non-MLA model guard forupdate_mla_absorbed_weights, HSDP barrier ordering).src/prime_rl/transport/mx_rendezvous.py(~185 LOC of in-tree rendezvous code) onto NVIDIA's publishedmodelexpressPython clients (MxV2TrainingPublisher/MxV2RefitReceiver). Inherits heartbeat + freshest-per-rank dedup + retention + the v2 sidecar filter (modelexpress PR #295) for free. The in-treeNixlAgentWrapper,Slot,TransportPlan, andclassic_cuda_poolstay untouched — that's prime-rl-specific data-plane specialization.compile_target+compile_metadatafield so receivers filter on compatibility. Heterogeneous fleets (DeepGemm and cutlass on the same training run) now work without trainer-side branching.TargetTPLayout+ multi-source slice discovery. Same machinery as our NemoRL v2 MoE expert filtering, generalized to dense matmul axes.The plan pulls heavily on the NemoRL × Dynamo path (NVIDIA, @jthomson04) which is already running cross-node at 380 Gbps on GB300 RoCE for an 8.82 GB / 399-tensor refit on Qwen3-4B-Thinking-2507 — same scratch-buffer +
worker_extension_clspattern this plan adopts.What's in this PR
Doc only. 517 lines at
docs/proposals/post-pr2389-kernel-compile-plan.md. Includes component + per-refit sequence diagrams (mermaid). No code changes; implementation phases sequence behind this RFC's acceptance.Why a draft RFC against
nixl_mxThe plan only makes sense in the context of this branch's code. Targeting
mainnow would dangle (nonixl_mxto build on). Plan: re-target tomainonce #2389 merges, then land Phase 1 quickly as a follow-up PR.Estimated impact
mx_rendezvous.pydeleted) + 150 (import-and-call)Total ~450 LOC additive for Phases 3-4, plus the ~−400 LOC subtraction from Phase 2 maintenance burden.
Test plan
N/A — doc only. Each implementation phase ships its own test plan in the doc (see §8). Phase 3 validation piggybacks on the existing NemoRL+Dynamo GB300 cluster to de-risk the compile-pass design before porting into the prime-rl worker.
Note
Low Risk
Documentation only; no production code, config, or transport behavior changes in this PR.
Overview
Adds
docs/proposals/post-pr2389-kernel-compile-plan.md, a doc-only RFC (~517 lines) for work after PR #2389 lands. It does not change runtime code.The proposal keeps the existing
nixl_mxdata plane (Slot,TransportPlan,NixlAgentWrapper, pools) and plans rendezvous/metadata extensions only:MxRendezvouswith ModelExpressMxV2TrainingPublisher/MxV2RefitReceiver; adoptworker_extension_clson the vLLM worker.CompilePass(hf_raw, DeepGemm, cutlass); extend the v2 registry withcompile_target/compile_metadataandcompile_target_filteron discovery.TargetTPLayout, slice-awarereceive_weights_scratch, and multi-sourcediscover_v2_sources_for_slice.The doc includes mermaid architecture/sequence diagrams, phased LOC estimates, open questions, and links to ModelExpress/NemoRL validation paths (e.g. scratch refit on GB300).
Reviewed by Cursor Bugbot for commit 7feee0d. Bugbot is set up for automated code reviews on this repo. Configure here.